Chapter 12
Comparing Proportions and Analyzing
Cross-Tabulations
IN THIS CHAPTER
Testing for association between categorical variables with the Pearson chi-square and Fisher
Exact tests
Estimating sample sizes for tests of association
Suppose that you are studying pain relief in patients with chronic arthritis. Some are taking
nonsteroidal anti-inflammatory drugs (NSAIDs), which are over-the-counter pain medications. But
others are trying cannabidiol (CBD), a new potential natural treatment for arthritis pain. You enroll
100 chronic arthritis patients in your study and you find that 60 participants are using CBD, while the
other 40 are using NSAIDs. You survey them to see if they get adequate pain relief. Then you record
what each participant says (pain relief or no pain relief). Your data file has two dichotomous
categorical variables: the treatment group (CBD or NSAIDs), and the outcome (pain relief or no pain
relief).
You find that 10 of the 40 participants taking NSAIDs reported pain relief, which is 25 percent. But 33
of the 60 taking CBD reported pain relief, which is 55 percent. CBD appears to increase the
percentage of participants experiencing pain relief by 30 percentage points. But can you be sure this
isn’t just a random sampling fluctuation?
Data from two potentially associated categorical variables is summarized as a cross-tabulation,
which is also called a cross-tab or a two-way table. Because we are studying the association between
two variables, this is a form of bivariate analysis. The rows of the cross-tab represent the different
categories (or levels) of one variable, and the columns represent the different levels of the other
variable. The cells of the table contain the count of the number of participants with the indicated levels
for the row and column variables. If one variable can be thought of as the “cause” or “predictor” of the
other, the cause variable becomes the rows, and the “outcome” or “effect” variable becomes the
columns. If the cause and outcome variables are both dichotomous, meaning they have only two levels
(like in this example), then the cross-tab has two rows and two columns. This structure contains four
cells containing counts, and is referred to as a 2-by-2 (or 2 × 2) cross-tab, or a fourfold table. Cross-
tabs are displayed with an extra row at the bottom and an extra column at the right to contain the sums
of the cells in the rows and columns of the table. These sums are called marginal totals, or just
marginals.
Comparing proportions based on a fourfold table is the simplest example of testing the association
between two categorical variables. More generally, the variables can have any number of categories,
so the cross-tab can be larger than 2 × 2, with multiple rows and many columns. But the basic question
to be answered is always the same: Is the spread of numbers across the columns so different from
one row to the next that the numbers can’t be explained away as random fluctuations? Another way